In this dataset, were are going to be exploring sales information provided.



In [1]:

    
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.









    



Video_Games_Sales_as_at_22_Dec_2016.csv

Let us start by importing some helper libraries and the dataset as well



In [2]:

    
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
from brewer2mpl import qualitative



In [3]:

    
df= pd.read_csv("../input/Video_Games_Sales_as_at_22_Dec_2016.csv")
df.shape









    Out[3]:





(16719, 16)

Now that we know the shape of the dataset, let's have a peek at the data and try to find if they are any missing values.



In [4]:

    
df.head()









    Out[4]:







  
    
      
      Name
      Platform
      Year_of_Release
      Genre
      Publisher
      NA_Sales
      EU_Sales
      JP_Sales
      Other_Sales
      Global_Sales
      Critic_Score
      Critic_Count
      User_Score
      User_Count
      Developer
      Rating
    
  
  
    
      0
      Wii Sports
      Wii
      2006.0
      Sports
      Nintendo
      41.36
      28.96
      3.77
      8.45
      82.53
      76.0
      51.0
      8
      322.0
      Nintendo
      E
    
    
      1
      Super Mario Bros.
      NES
      1985.0
      Platform
      Nintendo
      29.08
      3.58
      6.81
      0.77
      40.24
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      Mario Kart Wii
      Wii
      2008.0
      Racing
      Nintendo
      15.68
      12.76
      3.79
      3.29
      35.52
      82.0
      73.0
      8.3
      709.0
      Nintendo
      E
    
    
      3
      Wii Sports Resort
      Wii
      2009.0
      Sports
      Nintendo
      15.61
      10.93
      3.28
      2.95
      32.77
      80.0
      73.0
      8
      192.0
      Nintendo
      E
    
    
      4
      Pokemon Red/Pokemon Blue
      GB
      1996.0
      Role-Playing
      Nintendo
      11.27
      8.89
      10.22
      1.00
      31.37
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

The cell below shows the data type of the columns



In [5]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
Name               16717 non-null object
Platform           16719 non-null object
Year_of_Release    16450 non-null float64
Genre              16717 non-null object
Publisher          16665 non-null object
NA_Sales           16719 non-null float64
EU_Sales           16719 non-null float64
JP_Sales           16719 non-null float64
Other_Sales        16719 non-null float64
Global_Sales       16719 non-null float64
Critic_Score       8137 non-null float64
Critic_Count       8137 non-null float64
User_Score         10015 non-null object
User_Count         7590 non-null float64
Developer          10096 non-null object
Rating             9950 non-null object
dtypes: float64(9), object(7)
memory usage: 2.0+ MB

The cell below shows the column name and the number of empty rows.



In [6]:

    
df.isnull().sum()









    Out[6]:





Name                  2
Platform              0
Year_of_Release     269
Genre                 2
Publisher            54
NA_Sales              0
EU_Sales              0
JP_Sales              0
Other_Sales           0
Global_Sales          0
Critic_Score       8582
Critic_Count       8582
User_Score         6704
User_Count         9129
Developer          6623
Rating             6769
dtype: int64

A list of all the columns in the dataset



In [9]:

    
df.columns.tolist()









    Out[9]:





['Name',
 'Platform',
 'Year_of_Release',
 'Genre',
 'Publisher',
 'NA_Sales',
 'EU_Sales',
 'JP_Sales',
 'Other_Sales',
 'Global_Sales',
 'Critic_Score',
 'Critic_Count',
 'User_Score',
 'User_Count',
 'Developer',
 'Rating']

Calculating the % of missing values



In [10]:

    
df_na= ( df.isnull().sum() / len(df) ) * 100
df_na= df_na.drop(df_na[df_na == 0].index).sort_values(ascending= False)



In [11]:

    
f, ax= plt.subplots(figsize=(12, 8))
plt.xticks(rotation='90')
sns.barplot(x=df_na.index, y=df_na.values)
ax.set(title='Missing Values Plot', ylabel='% Missing')









    Out[11]:





[Text(0,0.5,'% Missing'), Text(0.5,1,'Missing Values Plot')]

Unique Gaming platforms



In [13]:

    
df.Platform.unique()









    Out[13]:





array(['Wii', 'NES', 'GB', 'DS', 'X360', 'PS3', 'PS2', 'SNES', 'GBA',
       'PS4', '3DS', 'N64', 'PS', 'XB', 'PC', '2600', 'PSP', 'XOne',
       'WiiU', 'GC', 'GEN', 'DC', 'PSV', 'SAT', 'SCD', 'WS', 'NG', 'TG16',
       '3DO', 'GG', 'PCFX'], dtype=object)



In [14]:

    
#df.Platform.value_counts()
ssc = df.Platform.value_counts()
f, ax= plt.subplots(figsize=(12, 8))
plt.xticks(rotation='90')
sns.barplot(x=ssc.values, y=ssc.index, orient='h')
ax.set(title='Consoles by count', ylabel='Count')
f.tight_layout()

dropping all NA values



In [15]:

    
df_clean= df.dropna(axis=0)
df_clean.shape









    Out[15]:





(6825, 16)



In [16]:

    
ssc = df_clean.Platform.value_counts()
f, ax= plt.subplots(figsize=(12, 8))
plt.xticks(rotation='90')
sns.barplot(x=ssc.values, y=ssc.index, orient='h')
ax.set(title='Consoles by count after dropping NAs', ylabel='Count')
f.tight_layout()



In [17]:

    
#df['User_Score']= df['User_Score'].convert_objects(convert_numeric=True)
df_clean.User_Score= df_clean.User_Score.astype('float')
#df.User_Score.dtype
#df['User_Score'] = video['User_Score'].convert_objects(convert_numeric= True)









    



/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py:3110: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

Plot of the user score v the critic score of games. It appears the users and the critics agree on games with score greater than 8.



In [18]:

    
sns.jointplot(x='User_Score', y='Critic_Score', data=df_clean, kind='hex', cmap='coolwarm', size=7)









    Out[18]:





<seaborn.axisgrid.JointGrid at 0x7f068b56bbe0>

Critic score v critic count. From this plot, we observe that few critics give scores above 80.



In [19]:

    
sns.jointplot(x='Critic_Score', y='Critic_Count', data=df_clean, kind='hex', cmap='plasma', size=7)









    Out[19]:





<seaborn.axisgrid.JointGrid at 0x7f068b4787f0>

CORRELATION BETWEEN COLUMNS



In [22]:

    
stats=['Year_of_Release','NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 
       'Global_Sales', 'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count', 
       'Rating']
corrmat = df_clean[stats].corr()

f, ax = plt.subplots(figsize=(10, 7))
plt.xticks(rotation='90')
plt.title('correlation between columns')
sns.heatmap(corrmat, square=True, linewidths=.5, annot=True)









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0689a3c1d0>

Taking a look at Playstation



In [25]:

    
play= df_clean[(df_clean['Platform']== 'PS2') | (df_clean['Platform']== 'PS3')
              | (df_clean['Platform']== 'PS')| (df_clean['Platform']== 'PS4')]
play.shape









    Out[25]:





(2298, 16)

Playststation Global 1994-2016



In [26]:

    
sales_Play= play.groupby(['Year_of_Release', 'Platform'])['Global_Sales'].sum()
sales_Play.unstack().plot(kind='bar',stacked=True, colormap= 'Oranges',  grid=False)
ax.set(title='Playststation Global over the year', ylabel='Cumulative count')









    Out[26]:





[Text(3.2,0.5,'Cumulative count'),
 Text(0.5,1,'Playststation Global over the year')]

Top selling genre for Playstation



In [28]:

    
sales_Play= play.groupby(['Genre', 'Platform'])['Global_Sales'].sum()
sales_Play.unstack().plot(kind='bar',stacked=True, colormap= 'Oranges',  grid=False)









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f068950fb70>

Rating of the games made



In [29]:

    
sales_Play= play.groupby(['Rating', 'Platform'])['Global_Sales'].sum()
sales_Play.unstack().plot(kind='bar',stacked=True, colormap= 'Oranges',  grid=False)









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0689635cc0>

Taking a closer look at Xbox



In [30]:

    
xb= df_clean[(df_clean['Platform']== 'X360') | (df_clean['Platform']== 'XOne')
              | (df_clean['Platform']== 'XB')]
xb.shape









    Out[30]:





(1582, 16)

Global sales of the Xbox consoles globally 1994-2016



In [32]:

    
sales_xb= xb.groupby(['Year_of_Release', 'Platform'])['Global_Sales'].sum()
sales_xb.unstack().plot(kind='bar',stacked=True, colormap= 'Vega20',  grid=False)









    



/opt/conda/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: The Vega20 colormap was deprecated in version 2.0. Use tab20 instead.
  warnings.warn(message, mplDeprecation, stacklevel=1)






    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f068b494198>

Top selling genre per Xbox console. The top selling genre is the shooter, which makes sense because of the halo franchise.



In [33]:

    
sales_xb= xb.groupby(['Genre', 'Platform'])['Global_Sales'].sum()
sales_xb.unstack().plot(kind='bar',stacked=True, colormap= 'Vega20',  grid=False)









    



/opt/conda/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: The Vega20 colormap was deprecated in version 2.0. Use tab20 instead.
  warnings.warn(message, mplDeprecation, stacklevel=1)






    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f068933a6a0>

Rating and global sales



In [35]:

    
sales_xb= xb.groupby(['Rating', 'Platform'])['Global_Sales'].sum()
sales_xb.unstack().plot(kind='bar',stacked=True, colormap= 'Vega20',  grid=False)









    



/opt/conda/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: The Vega20 colormap was deprecated in version 2.0. Use tab20 instead.
  warnings.warn(message, mplDeprecation, stacklevel=1)






    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f06892cadd8>

Taking a closer look at nintendo



In [36]:

    
nintendo= df_clean[(df_clean['Platform']== 'DS') | (df_clean['Platform']== 'Wii')
              | (df_clean['Platform']== 'GC')| (df_clean['Platform']== 'GBA')
                  |(df_clean['Platform']== '3DS') | (df_clean['Platform']== 'WiiU')]
nintendo.shape









    Out[36]:





(1772, 16)

Platform and total global sales from 1994-2016



In [37]:

    
nintendo_sales= nintendo.groupby(['Year_of_Release', 'Platform'])['Global_Sales'].sum()
nintendo_sales.unstack().plot(kind='bar',stacked=True, colormap= 'Set1',  grid=False)









    Out[37]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0689213da0>

Genre and total sales on platform. Nintendo looks to be selling alot of sports oriented games, especially on the Wii. However the Wii U is struggling in sales.



In [38]:

    
nintendo_sales= nintendo.groupby(['Genre', 'Platform'])['Global_Sales'].sum()
nintendo_sales.unstack().plot(kind='bar',stacked=True, colormap= 'Set1',  grid=False)









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f068912ef28>

Rating and total global sales. Nintendo sold majorly in the category of E (everyone)



In [39]:

    
nintendo_sales= nintendo.groupby(['Rating', 'Platform'])['Global_Sales'].sum()
nintendo_sales.unstack().plot(kind='bar',stacked=True, colormap= 'Set1',  grid=False)









    Out[39]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0689173b38>



In [52]:

    
current_gen= df_clean[(df_clean['Platform']== 'Wii') | (df_clean['Platform']== 'X360') | 
                                                        (df_clean['Platform']== 'PS3')]
current_gen.shape









    Out[52]:





(2106, 16)

Comparing the top selling platforms, last generation



In [54]:

    
current_gen_sales= current_gen.groupby(['Year_of_Release', 'Platform'])['Global_Sales'].sum()
current_gen_sales.unstack().plot(kind='bar',stacked=True, colormap= 'Set2',  grid=False)









    Out[54]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f06888f0e80>



In [55]:

    
current_gen_sales= current_gen.groupby(['Genre', 'Platform'])['Global_Sales'].sum()
current_gen_sales.unstack().plot(kind='bar',stacked=True, colormap= 'Set2',  grid=False)









    Out[55]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0688b0e208>



In [56]:

    
current_gen_sales= current_gen.groupby(['Rating', 'Platform'])['Global_Sales'].sum()
current_gen_sales.unstack().plot(kind='bar',stacked=True, colormap= 'Set2',  grid=False)









    Out[56]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0689362c50>

last generation sales from North America



In [44]:

    
current_gen_sales= current_gen.groupby(['Year_of_Release', 'Platform'])['NA_Sales'].sum()
current_gen_sales.unstack().plot(kind='bar',stacked=True, colormap= 'Blues',  grid=False)









    Out[44]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0688d9a128>

Last generation sales from Japan



In [45]:

    
current_gen_sales= current_gen.groupby(['Year_of_Release', 'Platform'])['JP_Sales'].sum()
current_gen_sales.unstack().plot(kind='bar',stacked=True, colormap= 'Blues',  grid=False)









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0688ca14a8>

Last generation sales from EU



In [46]:

    
current_gen_sales= current_gen.groupby(['Year_of_Release', 'Platform'])['EU_Sales'].sum()
current_gen_sales.unstack().plot(kind='bar',stacked=True, colormap= 'Blues',  grid=False)









    Out[46]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0688c620b8>



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	Name	Platform	Year_of_Release	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales	Critic_Score	Critic_Count	User_Score	User_Count	Developer	Rating
0	Wii Sports	Wii	2006.0	Sports	Nintendo	41.36	28.96	3.77	8.45	82.53	76.0	51.0	8	322.0	Nintendo	E
1	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24	NaN	NaN	NaN	NaN	NaN	NaN
2	Mario Kart Wii	Wii	2008.0	Racing	Nintendo	15.68	12.76	3.79	3.29	35.52	82.0	73.0	8.3	709.0	Nintendo	E
3	Wii Sports Resort	Wii	2009.0	Sports	Nintendo	15.61	10.93	3.28	2.95	32.77	80.0	73.0	8	192.0	Nintendo	E
4	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37	NaN	NaN	NaN	NaN	NaN	NaN